Considering faulty behaviors: message loss, out-of-order delivery, component failures

General consideration of faults and fault recovery

To design systems that can deal with faults, one first has to consider what kind of faults should be handled by the system. Then it is important to determine how such failures can be detected and what exceptional error handling is to be foreseen. In this context, one distinguishes between the following three actions:

fault detection (often the word "error" is used; the expression "an error is detected" means that a deviation from the requirements has been detected)
fault localization (diagnostics), sometimes called fault isolation: this means finding the cause of the error
fault recovery (of possible), that is, get the system organized as to perform the required task in spite of the fault. In the case of a hardware fault, this involves either the repair of the faulty component or its replacement (which takes time), or redundant hardware which could take over.

It is important to distinguish different failure modes:

Fail-Safe: in case of failure no bad result will occur - the component simply stops doing anything. If it is a computer, one would say it crashes, that is, it does not respond any more to any input.
Examples: a car does not start; a tuning fork does not produce any sound. Sometimes one distinguishes between
- Fail-stop failure: The failed component remains silent until the fault is recovered.
- Intermittent failure: The component may start working again after some time - without any external recovery actions.
Byzantine: the component produces wrong results (possibly malicious). In this failure mode, no assumption can be made on the behavior of the faulty component. For instance, it may respond to inputs as usual, however, the parameters of the output messages may be wrong. This includes the case that the component was hijacked and is now used by an intruder to corrupt a distributed system as a whole. Examples: the steering wheel of a car does not control the tires (and the car collides with a tree); the tuning fork produces a wrong pitch.

Fail-safe failures are much easier to detect, localize and recover. A stand-by unit is sufficient, e.g. if the tuning fork does not produce any sound, use the stand-by unit. The total number of units available (n) must be larger than the number of units that may fail (f): n > f

A Byzantine failure can only be detected by comparing the result with the result produced by other components. This leads to the triple redundancy design: In the case of highly reliable systems, one often uses triple redundancy, that is, three identical components that perform the task in parallel. At the end of an operation, a comparison unit compares the three results obtained and if they are not identical one can identify the faulty component (under the assumption that only a single unit fails at a time). Then one uses the result of the other two components (the system is fault-tolerant for a single fault) and tries to replace the faulty component as fast as possible (before the next component may fail). We have n > f + 1 under the assumption that different failing units will not produce the same wrong result; in general we have n > 2*f. This is assuming that we have an absolutely reliable comparison unit. In the case that we have no centralized comparison unit, but a distributed system, we have n > 3*f

Here are some other important concepts:

Fault tolerance (when a fault can be recovered)
Fault resilience: A fault may have an impact on the performance of the system, but the important functions will be continued to be offered (e.g the ABP is resilient for erroneous and lost messages).
Fail-safeness: A system is fail-safe if it remains safe in the case of a fault. This means that no catastrophic error will occur. A "catastrophe" is a situation that should never occur, such as the explosion of a nuclear reactor or the opening of the doors of a train in motion.

Faults in distributed systems

One characteristics of distributed systems, as compared with parallel systems, is the fact that one has to assume that some faults may orrur within the system. Therefore the system has to be designed to recover from such faults. For example, the purpose of the first communication protocols, such as the ABP, was to recover from faults in message transfer, either messages delivered with transmission errors or messages completely lost. One has to deal with faults related to message transmission and component failures (see below).

Design strategies for faults related to message transmission

Messages with transmission errors: Additional message parameter(s) to introduce redundancy (e.g. cyclic redundancy checks of Link Layer protocols) and consider any message with a detected transmission error as lost. NOTE: This means, the Byzantine fault of transmission error is transformed into a fail-safe fault of message loss.
Message loss: (a) Design the protocol such that the sender, after having sent the message, expects to receive (after some time) a message which is only received if the sent message has been properly processed by the receiver (and possible other parties involved in the protocol). (b) Introduce a time-out mechanism in the sender which will generate a time-out if the message mentioned under (a) is not received in time.
- The selection of an appropriate time-out period is critical - some message should not be received twice since they have side effets which should not be duplicated.
- Ideally, the protocol should still work correctly when the time-out period is chosen very short and the message processing becomes duplicated.
  - The ABP is an interesting example: In this case, a too short time-out period leads to an operation mode with duplicated messages (less efficiency), however, the logical correctness of reliable data transmission remains valid.

Issues with message delivery

Assuming that there are N system components that communicate with one another through the exchange of messages, the following situations may characterize the message transmission service from component A to component B (here we assume that transmission errors have been detected by reduncancy parameters and lead to message loss). When designing a protocol for some application, one has to consider which of these cases applies and design the protocol accordingly.

Reliable in-order transmission: All messages are correctly received by B in the same order they were sent by A.
- This is usually the case if the components establish a logical connection and use a reliable transport protocol, such as TCP.
- However, the transport may fail (because of problems within the networking infrastructure). In this case, one does not know whether the last messages sent were received at the other side, however, both parties get informed about the occurrence of this problem and may take appropriate recovery actions, possibly restarting the application.
Unreliable in-order transmission: Messages are received by B in the same order as they were sent by A, but it may occur that they are delivered with transmission errors, or they are lost.
- Over a single physical link, such as a local area network or a transmission line, packets are never delivered out of order (since they are not stored in the network).
Unreliable out-of-order transmission: Messages are not necessarily received by B in the same order as they were sent by A, and they may be delivered with transmission errors, or becomne lost.
- Out-of-order reception and message loss may occur when each message is sent as a separate IP packet over the Internet, for instance when UDP is used as the transport protocol.
- Some protocols have at most one message in transit in each direction of communication (as for instance the ABP). In such a case, out-of-order delivery is not an issue.

Component failures

Fault tolerance for Byzantine failures is very difficult (see for instance Wikipedia). For Fail-safe failures, one usually uses a "are you alive protocol" - that is, a neighbor sends from time to time a message "are you alive" which should be answered by "yes". If the answer does not arrive the neighbor assumes that the component failed.

In the case of a two-party protocol, one usually does not consider component failures, that is, if such a failure occurs, the whole system becomes non-operational. However, there are multi-party protocols that recover from single or multiple component failures - for instance in the case of (a) load-sharing protocols between several servers, (b) peer-to-peer systems, (c) distributed databases, etc. - Note: some of the protocols proposed for the course projects are of this nature.

Created: October 30, 2014